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Abstract 

The goal of this work is to bring semantics into the tasks 
of text recognition and retrieval in natural images. Al¬ 
though text recognition and retrieval have received a lot of 
attention in recent years, previous works have focused on 
recognizing or retrieving exactly the same word used as a 
query, without taking the semantics into consideration. 

In this paper, we ask the following question: can we pre¬ 
dict semantic concepts directly from a word image, with¬ 
out explicitly trying to transcribe the word image or its 
characters at any point? For this goal we propose a con¬ 
volutional neural network (CNN) with a weighted ranking 
loss objective that ensures that the concepts relevant to the 
query image are ranked ahead of those that are not rele¬ 
vant. This can also be interpreted as learning a Euclidean 
space where word images and concepts are jointly embed¬ 
ded. This model is learned in an end-to-end manner, from 
image pixels to semantic concepts, using a dataset of syn¬ 
thetically generated word images and concepts mined from 
a lexical database (WordNet). Our results show that, de¬ 
spite the complexity of the task, word images and concepts 
can indeed be associated with a high degree of accuracy. 


1 . Introduction 

In recent years there has been an increased interest in 
tasks related to text recognition and retrieval in natural im¬ 
ages [15, 27]. For example, given an image of a word, one 
may be interested in recognizing the word, either using a list 
of possible transcriptions [4, 10, 27] or in an unconstrained 
manner [6, 12]. There has also been a growing interest in 
word image retrieval: given a query, which can be either a 
text string or another word image, one tries to retrieve the 
relevant word images in a dataset [4, 10]. 

In all these cases, the goal has been to retrieve or 
recognize exactly the same word used as a query, with¬ 
out taking the semantics into consideration. For example, 
given a query image with the word phoenix, it would be 
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Figure 1: Comparison of standard scene text recognition 
and retrieval (top), and the proposed word image under¬ 
standing tasks (bottom). Strings in quotes represent text 
strings while strings in bounding boxes represent concepts. 


transcribed as phoenix, without any consideration of its 
meaning. Similarly, using the text string restaurant as 
a query would only retrieve images containing this word in 
them (see Figure 1 top). 

In contrast, in this paper we are interested in the problem 
of word image understanding, i.e. we wish to bring seman¬ 
tics into the tasks of word image recognition and retrieval. 
For example, we would like to capture the semantic mean¬ 
ings of the word phoenix as both a city and a state capital, 
and also its semantic meaning as a mythical being (see Fig¬ 
ure 1 bottom). Semantics play a very important role in scene 
understanding and for scene text, particularly in urban sce¬ 
narios, they will allow one to perform tasks beyond simple 
lexical matching. To illustrate this, let us take the example 
of a system which would parse a street scene and especially 
which would classify building faces into different business 
classes such as restaurants, hotels, banks, etc. While the 
presence of a sign pizzeria is indicative of a restaurant, 
the mere transcription of the text in the sign is not sufficient 
in itself to deduce this. Additional reasoning capabilities 
enabled by an understanding of the semantics of the word 
are required to make the classification decision. 

A straightforward, two-step approach to achieving this 
goal would be to first transcribe the word image, and then 
match the transcriptions to the semantic concepts. The tran¬ 
scriptions could be matched using lexical databases of En¬ 
glish such as WordNet [1], that contain a hierarchy of words 
annotated with semantic concepts. In this two-step ap- 
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proach, the word-image recognition step can be understood 
as one of extracting mid-level features - the transcriptions 

- which are then fed to a second classification step. 

However, this approach has significant shortcomings. 
First, it relies on an accurate transcription of word images. 
Although the state-of-the-art in word image recognition has 
significantly leaped forward in recent years [4, 10, 11, 12], 
the results are still not perfect, particularly when word im¬ 
ages are not cropped exactly. This is a very typical scenario 
in end-to-end word recognition, where one first has to lo¬ 
calize the word in the image, crop it, and then recognize it. 
Second, the approach cannot deal with out-of-vocabulary 
words. Even if a word is transcribed correctly, if the word 
does not appear in the lexical resource, it will not be pos¬ 
sible to assign concepts to it. Finally, this approach does 
not lead to a compact representation of word images that 
encodes semantics. Such a representation is desirable as it 
could be used as an input feature for other tasks such as 
clustering word images that share semantics, or searching 
among word images using a semantic concept as a query - 
see Figure 7 for an example. 

In this paper, we ask the following question: can we pre¬ 
dict semantic concepts directly from a word image with¬ 
out explicitly trying to transcribe the word image or its 
characters at any point? While this might sound hope¬ 
less, because different word images corresponding to the 
same concept may have widely varying appearances (see 
the restaurant example in Figure 1 bottom), we show 
that, surprisingly, this is indeed possible. For this goal we 
propose to use a convolutional neural network (CNN) [18] 
with a weighted ranking loss objective [28] that ensures that 
the concepts relevant to the query image are ranked ahead of 
those that are not relevant. This model is learned in an end- 
to-end manner, from image pixels to semantic concepts. 
Importantly, one can interpret this learned architecture as 
a way to embed word images and concepts in a common, 
latent subspace (see Figure 2). In particular, the weights of 
the last layer of the network can be seen as a transductive 
embedding of the semantic concepts - to add new concepts, 
one would need to retrain or fine-tune the network. On the 
other hand, the activations of the previous-to-last layer of 
the network can be seen as an inductive embedding of the 
input word images: word images containing words that have 
not been observed during training can still be embedded in 
this space and matched with known concepts. Hence, we re¬ 
fer to our approach as Latent Embeddings for Word Images 
and their Semantics, or LEWIS for short. 

LEWIS addresses the problems of the straightforward 
two-step approach: it does not require one to extract the 
transcription of word images explicitly, it allows one to re¬ 
trieve concepts from word images not seen during training 
or that do not appear in our source of semantic concepts 

- akin to a zero-shot learning task - and it provides a fea- 



Eigure 2: Outline of our approach. Our goal is to learn two 
embedding functions f : I ^ IZ^ and : C IZ^ 
that embed images and concepts in a common subspace, and 
where embedded images should be closer to the embedded 
concepts they are related with than to the concepts they are 
not related with. This is learned in an end-to-end manner 
with a convolutional neural network and a ranking loss. 

ture vector representation of the word images that encodes 
its semantics rather than only its lexical information. As 
we will show in later sections, this allows one to go beyond 
predicting semantic categories from word images and per¬ 
form additional tasks such as searching word images using 
a concept as a query, or retrieving word images that share 
concepts with a query word image, even when both images 
depict different words. 

In summary, the contributions of the paper are four-fold. 
Eirst, we introduce a new task to the computer vision com¬ 
munity: predicting semantic categories of word images. 
Second, we enrich an existing large dataset of word im¬ 
ages with semantic transcriptions derived from WordNet 
[1] to evaluate the proposed problem - we plan to make 
these annotations available to the community. Third, to 
solve the proposed problem we introduce LEWIS, a solu¬ 
tion based on a convolutional architecture which does not 
involve transcribing the word image and which embeds both 
word images and semantic concepts in a latent common 
subspace. Eourth, we show experimentally that LEWIS 
performs comparably or better in terms of accuracy than 
a two-step approach that uses state-of-the-art word recogni¬ 
tion techniques, while offering many other advantages. 

The rest of the paper is organized as follows. Sec¬ 
tion 2 describes the related work. Section 3 introduces our 
method. In Section 4 we describe our experimental evalu¬ 
ation and discuss the results. Einally, Section 5 concludes 
the paper. 

2. Related Work 

We review the most related works to ours: those related 
to word image representations (embeddings), textual em¬ 
beddings, and joint image and semantic embeddings. 










Word image representations. Deriving suitable repre¬ 
sentations of word images for tasks such as recognition and 
retrieval in document images has been an active topic of re¬ 
search in the document analysis community for many years. 
However, only during recent years has that interest also em¬ 
braced word image representations in natural images, com¬ 
monly referred to as “scene text” [4, 6, 10, 13, 11, 20, 21, 
22, 24, 25, 26, 29]. 

Many of these works focus on localizing the individ¬ 
ual characters of the word image. Then, one may recog¬ 
nize the characters independently to produce a transcrip¬ 
tion [6, 22], or define a compatibility function between the 
character probabilities and text strings (using e.g. condi¬ 
tional random fields) and rank all possible words in a text 
dictionary or lexicon to find the most likely transcriptions 
[13, 20, 21, 26, 29]. The previous approaches do not explic¬ 
itly construct a feature representation of the word image, 
and their uses beyond recognition are limited. In contrast, 
some recent works [4, 10, 24, 25] focus on obtaining a fea¬ 
ture representation of the image using standard computer 
vision representations, without explicitly localizing its char¬ 
acters, and learning a compatibility function between these 
feature vector representations and embedded text strings. 

On a slightly different line, and closely related to our ap¬ 
proach, Jaderberg et al. [11] learn to classify words images 
into a set of 90,000 possible transcriptions. This is achieved 
in an end-to-end manner using Convolutional Neural Net¬ 
works (CNNs) and synthetic training data, and the approach 
obtained outstanding recognition results on standard bench¬ 
marks. Interestingly, the activations of the previous-to-last 
layer of the network can also be used as word image fea¬ 
tures for retrieval purposes. 

All these previous works focus solely on lexical similar¬ 
ities, and do not capture any semantics. On the contrary, we 
focus on capturing the semantic information of the words, 
and not simply information about the word transcription. 

The most closely related work to ours is the one by Kr- 
ishnan and Jawahar [16], that aims at performing word im¬ 
age retrieval preserving semantics, and, particularly, syn¬ 
onyms. This is very similar to the two-step baseline dis¬ 
cussed in the previous section. Contrary to us, [16] does 
not learn any joint space for images and semantics, and re¬ 
lies on query expansion using a dataset of images annotated 
with their synonyms. 

Text embeddings. There has been a recent resurgence of 
interest in embedding text in semantic Euclidean spaces in 
the natural language processing community. Examples of 
such works include Word2Vec [19] and GloVe [23]. This is 
achieved by unsupervised training on large corpora of text 
such as Wikipedia. However, these approaches focus on 
embedding text strings, and not word images. 


Images and their semantics Several works have consid¬ 
ered the problem of jointly embedding images and seman¬ 
tic categories in an intermediate Euclidean space. A simple 
way to do so is to perform a Canonical Correlation Analysis 
(CCA) on image representations and their tags [9] . Weston 
et al. [28] proposed WSABIE which can be understood as 
a neural architecture with a single hidden layer. The WSA¬ 
BIE objective function is a weighted ranking loss. We use a 
similar loss to learn our joint word-image and semantic con¬ 
cept embedding. An issue with WSABIE is that it cannot 
deal with zero-shot recognition. To address this problem, 
Erome et al. [8] proposed Devise, an embedding model that 
learns to map natural images to text embeddings learned 
with Word2Vec. Other recent works also used text embed¬ 
dings as output embeddings [2, 3], but focusing only on nat¬ 
ural images. By contrast, we do not leverage the Word2Vec 
representations of text, and rely on the graph taxonomy pro¬ 
vided by WordNet [1] to learn our embeddings. 

3. Learning latent embeddings 

We start by describing how we mine WordNet for se¬ 
mantic concepts. Then we describe our approach to ranking 
those concepts given an image, and how it can be under¬ 
stood as an embedding of word images and concepts in a 
common latent subspace. 

3.1. Mining WordNet for Semantic Concepts 

WordNet [ ] is a lexical database for the English lan¬ 
guage. Words are organized into groups of synonyms called 
synsets, and these groups in turn are organized in a hierar¬ 
chical manner using different semantic relations. One of 
these types of relations is hypemymy: Y is an hypernym of 
X if X is a kind of Y. Eor example, the word glass has sev¬ 
eral hypernyms, two of which are solid (when glass 
is a material) and container. Therefore, given a word, 
one can find the synset or synsets (if the word has several 
meanings) to which it belongs, and then climb through the 
hypernym hierarchy until a root is found. As an example, 
paths for the words jeep, cat, and dinosaur are shown 
in Eigure 3, where the number within brackets indicates the 
depth level of the hierarchy. 

In our work, we leverage these hierarchies to produce se¬ 
mantic annotations of words: given a word in our dataset, 
we first produce the set of synsets to which it belongs. 
Each synset in this set corresponds to a different, fine¬ 
grained, semantic meaning of the word. Then, for each 
synset in the set, we ascend the hypernym hierarchy, pro¬ 
ducing increasingly generic concepts with which we anno¬ 
tate the word. Annotating words with all of their hyper¬ 
nyms would produce tens of thousands of concepts, some 
of them very fine grained (e.g. goblet), while some oth¬ 
ers being extremely generic (e.g. entity). Instead, we 
collect only the concepts at a given depth level, control- 
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Figure 3: A section of the WordNet hierarchy showing three 
words and their hypernyms up to the root. Note that cat 
and dinosaur would be given the same label for depth 
level 8 and above, but different labels otherwise. On the 
other hand, jeep and dinosaur would not share con¬ 
cepts until reaching depth level 3. 


ling the granularity of the concepts. For example, when 
choosing level 9, cat would be labeled as mammal and 
dinosaur as reptile, while at level 8 both would be 
labeled as vertebrate. We evaluate and compare the 
results for different choices of the depth level. 

This annotation approach still produces several thou¬ 
sands of classes, some of which are very populated while 
others contain as few as one single word. For evaluation 
purposes, we will only annotate words with the K most 
populated classes and evaluate the effect of changing the 
value of K. 

It is worth noting that, although we base our evaluation 
on semantic concepts extracted from WordNet, this is not a 
requirement, and other sources of semantic annotations can 
be exploited by our method. 

3.2. Ranking semantic concepts 

Motivated by the overwhelming success of convolutional 
neural networks (CNN) [18] for image classification [17] 
and word image recognition [ 11 ], we adopt a similar archi¬ 
tecture. We follow Jaderberg et al [11] and use 5 convolu¬ 
tional layers followed by 3 fully connected layers. In our 
case the output dimensionality of the last fully connected 
layer is the number of semantic concepts K. Please refer to 
Section 4.2 for more details about the network architecture. 

A significant difference with [17, 1 1] is that these works 
address a mono-label classification problem while we con¬ 
sider a multi-label problem since multiple concepts can be 
assigned to an image. Hence, we cannot adopt the stan¬ 
dard classification objective which involves computing the 
cross-entropy between the network output and the ground 
truth label. Instead, we make use of a ranking framework 
which we now explain. 


Let us assume a set of N training images 
{Xi,X 2 ,... ,X 7 v}, and a set of K semantic concepts 
{Cl, C 2 ,..., Cx} produced as described in the previous 
section. Let us also assume that, for training purposes, 
each image is annotated with at least one semantic concept. 
As the transcriptions of the word images are available at 
training time, this can be achieved by propagating to each 
image the concepts that are relevant to its transcription. 
Let us denote by r(X) the set of concept indexes that are 
relevant to image X and by f (X) its complementary. 

Given this setup, we are interested in finding a compat¬ 
ibility function F between images and concepts such that 
the number of non-relevant concepts that are ranked ahead 
of relevant concepts is minimized. Given an image X, the 
last fully connected layer of the architecture produces a 
prediction vector Y G 7^^, where Yi represents the pre¬ 
dicted compatibility between the image and concept C^, i.e. 
Yi = F(X, Ci). A possible ranking objective is to enforce: 

XI \f{x,c„)>f(i,c^)], (1) 

X per(X) 
n^f{X) 

where l[cond] is the indicator function that evaluates to 1 
when cond is true and to 0 otherwise. However, optimiz¬ 
ing Equation (1) directly is not feasible due to the indicator 
function, and instead we choose a differentiable surrogate. 
In particular, we choose the weighted approximately ranked 
pairwise loss (WARP) of Weston et al. [28]. This ranking 
loss places more emphasis on the top of the ranked list, lead¬ 
ing to superior results under many ranking metrics. 

Given two concepts p G r(X) and n G r(X), their WARP 
loss is computed as 

£{X,p, n) = L{rank{p)) • max(0,1 — + Yn). (2) 

Here, rank{p) denotes the ranked position of p, i.e., how 
many concepts obtained a better score than Cp, while L{r) 
is a loss function of the form: 

r 

L{r) — ^ ^ Qfj, with OLi > 0^2 ^ ^ O 5 (3) 

i=i 

where different choices of the aj coefficients lead to the 
optimization of different measures, and where aj = 1/j 
puts special emphasis on the first results, leading to superior 
top K accuracy and mean average precision [28]. 

Computing the loss over all possible pairs of p and n 
may be prohibitively expensive. Instead, given an image 
and a positive category (or concept in our case), one typ¬ 
ically samples negative categories until finding one which 
produces a positive loss, and uses that for the update. Sim¬ 
ilarly, computing the exact rank of p is expensive if K is 
not small. In that case, the rank of p can be estimated as 
, where s is the number of tries that was needed to 





find a negative category with a positive loss. Although this 
approximation is rough, particularly for items with multiple 
positive labels, it works well in practice [28]. 

The subgradient of the loss, needed for the backpropaga- 
tion stage of the training, is given by: 


di{I,p, n) 


{ —L{rank{p)) 
L{rank{p)) 

0 


i = p and £{I, p, n) > 0, 
i = n and £{I, p, n) > 0, 

otherwise. 

(4) 


3.3. Latent embeddings 

It has been shown in several recent works that CNNs can 
be used as generic feature extractors, and that these features 
are useful for tasks such as classification and retrieval [7, 5]. 
This is achieved by using the output activations of a given 
layer of the network. For example, the output activations 
of the last layer produces a task-specific “attributes” repre¬ 
sentation that encodes the scores that the image obtains for 
each of the classes used during learning. Extracting fea¬ 
tures from earlier layers produces more and more generic 
features, which are more and more disconnected from the 
learning objective [30]. 

Here we follow a similar idea and use the activations of 
the penultimate layer of our architecture (FC7, 4,096 di¬ 
mensions) as semantic representations of the word images. 
However, we also note that the columns of the weight ma¬ 
trix of the last layer can be seen as embeddings of the se¬ 
mantic concepts. What is more, because of the way the net¬ 
work is constructed, the dot product between the word em¬ 
beddings and the concept embeddings is exactly the com¬ 
patibility function F between word images and concepts 
that we were seeking. If we denote by 0(X) the activa¬ 
tions of the FC7 layer of the network given image X, and by 
ipk the k-th column of the weight matrix of the last layer, 
then FiX^Ck) = can be understood as a 

transductive embedding of concept Ck, i.e. we can define 
a function which acts as a simple look-up table such that 
V^(C/c) = This interpretation gives better insights into 
some of the tasks we perform such as querying the image 
dataset using a concept as a query, or performing an image- 
to-image search, as illustrated in Figure 2. 


4. Experiments 

We start by describing our datasets. We then describe our 
evaluation protocols and baselines and provide quantitative 
as well as qualitative results. 

4.1. Datasets 

We use three publicly available datasets in our exper¬ 
iments. The first is the Oxford Synthetic Word dataset 
[11], a very large dataset that contains 9 million annotated 


word images covering a dictionary of approximately 90,000 
English words. This dataset has been synthetically gen¬ 
erated by applying realistic distortions to rendered word 
images using randomly selected fonts from a catalogue of 
1,400 fonts downloaded from Google Fonts. The official 
train, validation, and test partitions contain approximately 
7.2 million, 800,000 and 820,000 images, respectively. De¬ 
spite being a synthetic dataset, models learned with it obtain 
outstanding results on real data [11]. 

In addition, we also evaluate the models learned in Ox¬ 
ford Synthetic in two other public datasets: the Street View 
Text (SVT) dataset [27], which contains a total of 904 
cropped word images harvested from Google Street View, 
and the HIT 5k-word (IIIT5K) dataset [20], which con¬ 
tains 5,000 cropped word images from natural and born- 
digital images. In both cases we only use the official test 
partitions (647 word images in SVT and 3,000 in IIIT5K). 

4.2. Implementation details 

To extract the semantic annotations of each word, we 
first find the hypernym path to the root for every meaning 
of the word, as described in Section 3.1. We then keep only 
the concepts at level I of each path. In our experiments, we 
evaluate the effect of varying I from 7 to 9. We follow this 
approach to extract concepts from the 90,000 words in the 
Oxford Synthetic dataset. Concepts are then sorted accord¬ 
ing to how many words were assigned to them, and only 
the top K most populated concepts are kept. In our experi¬ 
ments, we change the value of K from 128 up to 1,024. Any 
word that was not found in the WordNet database or that is 
not assigned to any concept in the top K is ignored, both 
at training and at test time. In the most fine-grained case 
(I = 9, K = 128) this leaves us with 9,900 unique words, 
and about 820,000 training images and 100,000 testing im¬ 
ages. On the other extreme (I = 7, K =1,024) our dataset 
contains 34,153 unique words, 3,000,000 training images, 
and 350,000 testing images. The mean number of concepts 
assigned to every image ranges from 1.2 and 1.7, and the 
maximum number of concepts assigned to a word is 13. 

Our CNN architecture replicates the one in Jaderberg et 
al. [11], except for the size of the last layer (90,000 in their 
case vs K in our case) and the loss (cross-entropy in their 
case, and a WARP ranking loss in our case). In particular, 
we use 5 convolutional layers with (64, 128, 256, 512, 512) 
kernels of sizes (5, 5, 3, 3, 3) and a stride of 1 pixel. A 
max pooling with size 2 and a stride of 2 pixels is applied 
after layers 1,2, and 4. This is followed by three fully con¬ 
nected layers (FC6, FC7, and FC8) of sizes (4,096, 4,096, 
K). A ReLU non-linearity is applied after every convolu¬ 
tional or fully connected layer. Dropout regularization is 
applied right after layers FC6 and FC7 with a drop rate of 
0.5. Input images are resized to 32 x 100 pixels without 
preserving the aspect ratio, as in [11]. 
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Figure 4: Quantitative results on Oxford Synthetic. The bars represent the accuracy of our proposed LEWIS method, while 
the dots represent the accuracy of the two-step baseline, (a), (b): image-to-concept and concept-to-image results on the 
original word images, which were accurately cropped, (c), (d): results on random crops of the word images. 
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(a) image-to-concept (SVT) (b) concept-to-image (SVT) (c) image-to-concept (IIIT5K) (d) concept-to-image (IIIT5K) 

Figure 5: Quantitative results on SVT ((a), (b)) and IIIT5K ((c),(d)). The bars represent the accuracy of the proposed LEWIS 
method, while the dots represent the accuracy of the two-step baseline. 


Learning was done using a modified version of 
Caffe [14]. For efficiency reasons, we first learned 3 inde¬ 
pendent models for / = 7, / = 8, and 1 = 9, fixing the size 
of K to 128, and then we fine-tuned those models to larger 
values of K. Learning all the models took approximately 3 
weeks using 2 Tesla K40 NVIDIA GPUs. 

4.3. Evaluation protocol 

We evaluate our approach on three different tasks. In 
image-to-concept retrieval, the goal is to annotate a query 
image with one or multiple concepts. This is exactly the 
task for which our CNN is optimized. We use each image 
in the test set of our datasets as a query and use it to re¬ 
trieve the K concepts ordered by similarity. The similarity 
between the word embedding and the concept embeddings 
is measured as the dot product, and we report mean aver¬ 
age precision. In concept-to-image retrieval, the goal is 
to retrieve images given a query concept. The similarity 
between the word embeddings and the concept embedding 
is also measured as the dot product. In this case, we ob¬ 
served that ^ 2 -normalizing the image features led to signifi¬ 
cant improvements. The evaluation metric is also the mean 
average precision. In image-to-image retrieval, we are in¬ 
terested in using one image as a query and retrieving other 
images that share at least one semantic concept. Images can 
be represented by the output of the FC7 layer, which cor¬ 
responds to the latent space, but also by the output of the 


last layer, which would correspond to an “attribute scores” 
layer, where the image is represented by stacking the sim¬ 
ilarities between the image and all K concepts. This is a 
more challenging task, since two images that have many 
different associated concepts but share one of them are still 
considered a match. In this case, we report precision at k, 
for values of k of 1, 10, and 50, and R-Precision, where the 
number of relevant images per query is used as cutoff. 

We consider two baselines in our experiments. The first 
one is the two-step approach based on transcribing the word 
image and matching the transcriptions. For this task we 
use a state-of-the-art dictionary CNN [11]. We use the pre¬ 
trained model that the authors made available. This model 
achieves around 95% transcription accuracy on the Oxford 
Synthetic dataset by choosing the right transcription out of a 
pool of 90,000 candidates. In this baseline, we first use this 
model to choose the most likely transcription of a given im¬ 
age, and then we propagate concepts extracted from Word- 
Net using that transcription. This allows us to match an im¬ 
age with concepts, and to perform both image-to-concept 
and query-by-image retrieval using inverted indices. 

As a second baseline, we use the output activations of 
the penultimate (FC7) layer of the same model as a feature 
representation of the words (4,096 dimensions). This is a 
very strong feature representation that encodes information 
about the characters of the word. We denote it with CNN- 
Dicty. These features can subsequently be used for image- 

































Table 1: Image-to-image retrieval with K = 256 concepts. 
We compare our features with the CNN-Dicty features [11]. 




p@i 

P@10 

P@50 

R-P 


CNN-Dictr 

96.20 

83.99 

29.67 

4.04 

17 

LEWIS (FC7) 

95.10 

91.89 

49.08 

12.21 


LEWIS (FC8) 

94.16 

91.79 

61.73 

29.05 


CNN-Dictr 

96.33 

84.45 

29.74 

4.20 

18 

LEWIS (FC7) 

95.50 

92.43 

49.73 

13.21 


LEWIS (FC8) 

94.67 

92.68 

67.99 
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to-image retrieval, or for concept prediction after learning a 
linear classifier on top. 

We also evaluate the effect of inaccurate cropping of the 
word images. In most realistic scenarios involving end- 
to-end tasks, it is necessary to localize and crop the word 
images out of larger images. Even if the localization tech¬ 
niques have improved in recent years, localization is still in¬ 
exact at best. To test the effect of this, as a surrogate of text 
localization, we perform random crops of the word images, 
randomly removing up to 20% of the image from left and 
right and up to 20% of the image from top and bottom. All 
of these cropped images still have an intersection over union 
with the originals larger or equal than (1 — 0.2)^ = 0.64, 
and would be accepted as positive localizations using the 
standard localization threshold of 0.5. 

4.4. Results and discussion 

Image-to-concept and concept-to-image retrieval tasks 

We first evaluate the proposed LEWIS approach on the 
image-to-concept and concept-to-image tasks on Oxford 
Synthetic and compare it with the two-step baseline. We 
report results in Eigure 4, (a) and (b). The bars represent 
our approach, and the dots denote the two-step baseline. 

We observe that increasing the number of levels {i.e., 
more fine-grained concepts) generally leads to improved re¬ 
sults. This is reasonable, as in the extreme case of one con¬ 
cept per word this is equivalent to the transcription prob¬ 
lem, which we know can be addressed with a CNN. On 
the other hand, less depth implies more words per semantic 
concept. This leads to a more multimodal problem, where 
very different images have to be assigned to the same class, 
increasing the difficulty. Increasing the number of concepts 
K also has a limited impact. As the concepts were sorted by 
number of words assigned, the first concepts are more diffi¬ 
cult than the subsequent ones, leading to a trade-off between 
having more concepts, but these concepts being easier. 

Compared to the two-step baseline, our method is 
slightly behind (about 1 percent absolute) on the image-to- 
concept task. However, in the concept-to-image task, we 
outperform the baseline. We believe the concept-to-image 


task is less forgiving towards badly transcribed images, as 
they affect negatively several queries, and that the softer na¬ 
ture of the proposed embeddings makes them more suitable 
for the task. We also evaluate on images with random crops 
in (c) and (d). In this case, as expected, the recognition 
baseline fails, while our approach is still able to detect key 
aspects of the word image and favor the appropriate con¬ 
cepts. We evaluate as well on the SVT and IIIT5K datasets 
(Eigure 5), where the results exhibit a similar behaviour. 
Despite having learned the models on Oxford Synthetic, the 
results on SVT and IIIT5K are still very accurate. 

We believe that learning the features with a deep archi¬ 
tecture focused on the semantics is a key factor in our ap¬ 
proach. This is demonstrated through the comparison with 
the second baseline: we use the state-of-the-art CNN-Dicty 
features [11] and learn a linear classifier using the same 
ranking loss and data we use to train LEWIS for a fair 
comparison. In this case, we only achieve an accuracy of 
59% mean average precision on the image-to-concept task 
(K = 128, I = 7), compared to the 95% achieved when 
learning the features with the semantic goal in mind. This 
shows that, to perform these types of tasks, traditional or 
recent word image features that only encode character in¬ 
formation are not suitable, and that it is necessary to learn 
and encode these semantics directly in the representation. 

Image-to-image retrieval We now focus on the image- 
to-image task, where one image is used as a query and the 
goal is to return all the images that are related, i.e., that 
have at least one concept in common at a given level. We 
compare the proposed LEWIS features, extracted from the 
previous-to-last layer (EC7, 4,096 dimensions) and the last 
layer (ECS, K dimensions), with the CNN-Dicty features 
of Jaderberg et al [11]. All features are ^ 2 -normalized and 
compared using the dot product. 

Table 1 reports results forK = 256 at several levels us¬ 
ing precision @1, @10, @50, and R-Precision as metrics. 
At precision @1, the CNN-Dicty features obtain superior 
results, as they are returning another image with exactly the 
same word and they do this with better accuracy. However, 
as k increases and images with different words need to be 
returned, its accuracy plummets, as this representation only 
encodes information to recognize the exact word. On the 
other hand, our embeddings still return meaningful results 
when k increases, even if they have not been learned explic¬ 
itly for this type of retrieval task. 

Qualitative results Eigure 6 illustrates some qualitative 
results for the image-to-concept task, using K = 128 
classes and depth levels 7 or 8. In many cases, the predicted 
concepts are very related to the query even if they do not 
appear in the ground truth annotations, showing that seman¬ 
tically similar concepts are being embedded in neighboring 
locations of the space. Eigure 7 shows qualitative results for 
the concept-to-image tasks, showing once again that images 
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Figure 6: Qualitative results on the image-to-concept task with K = 12S and concepts from levels 7 and 8. Many of the top 
predicted concepts are meaningful even if they do not appear amongst the ground truth ones. 
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Figure 7: Qualitative results on the concept-to-image task with = 128 and concepts from levels 7 and 8. For every concept, 
we show images of unique words returned in the first positions. No negative image was ranked ahead of any of these images. 


with very different transcriptions are still embedded close to 
their related concepts. Interestingly, we can combine con¬ 
cepts, by adding or subtracting the scores, to make more 
complex searches that still return meaningful results. 

Generalization One of the advantages of our method 
with respect to the baseline is that we can encode and find 
concepts for words that have not been observed during train¬ 
ing or that do not appear in WordNet. The previous ex¬ 
perimental results hinted that some generalization has been 
achieved. For example, the qualitative results of Figure 6 
showed that some concepts were predicted based on the 
roots of similar words, as those concepts did not appear in 
the ground truth of the words. This is consistent with the 
results using random crops, where reasonable results were 
obtained even if part of the word was missing. Here we 
test this explicitly by training a network on a subset of the 
training data (90% of words) and testing on a disjoint set 
(10% of words), where none of the testing words were ob¬ 
served during training. In this case, the results dropped from 
around 90% down to 56.1% (K = 128, I = 7) and 61.9% 
(K = 256, / = 7) in image-to-concept task, and to 40.6% 
and 52.8% in concept-to-image. Although there is a sig¬ 


nificant drop in accuracy, the results are still surprisingly 
high, given that this is a very arduous zero-shot problem. 
This shows that some generalization to new words is indeed 
achieved, likely through common roots of words. 

5. Conclusions 

In this paper we have introduced a new task to the com¬ 
puter vision community: predicting relevant semantic cat¬ 
egories for word images. We believe that solutions to this 
task can greatly benefit problems related to scene text, par¬ 
ticularly urban scene understanding. To address this new 
task, we propose an approach based on CNNs that learns 
to rank semantic concepts in an end-to-end manner, start¬ 
ing directly from the image pixels. The proposed approach, 
LEWIS, can be understood as learning an embedding space 
shared by both word images and semantic concepts. LEWIS 
performs similarly to or outperforms a two-step baseline 
based on a state-of-the-art word transcription method on a 
variety of tasks, while offering significant advantages. We 
also generated semantic annotations for an existing large- 
scale word image database, which we will share with the 
community to help further research on this task. 
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